Great Games
2022-11-20
1 Overview
Our research question is: how to audit whether labels are “conceptualized consistently” across different countries and different languages. We propose a framework to audit labels themselves for world-wide inclusivity. Here, we are not concerned with a specific ML implementation, and instead audit the labels that are being fed into systems.
Figure 1 is descriptive statistics of our survey.
- We consider country comparisons in three ways:
- Within each country, how much “internal” agreement is there on labels of interest? (Figure 2a)
- How do different populations view these labels in specific contexts? (Figure2b)
- If label outcome differences are found across countries, is there a data-backed explanation relating to culture? (Figure3)
- We consider language comparison in three ways:
- How similar are label choices for bilingual speakers when asked questions in their native language versus English? (Figure4)
- How similar are label choices for English-speakers across English-speaking countries? (Figure5)
- How similar are label choices for speakers of non-English language? (Figure6)
Figure 5 and 6 should look similar to Figure 2b using similar approach and have not been plotted currently in this doc.
bookdown::serve_book()2 Figure 1: Discriptive Statistics
source("read.R")
head(df)## # A tibble: 6 × 17
## Response.ID game country GENDER AGE ageCut native Langu…¹ region exper…²
## <chr> <chr> <fct> <chr> <int> <fct> <fct> <chr> <fct> <dbl>
## 1 US_English_No… PUBG US Male 45 45-55 Engli… English US 5
## 2 US_English_No… PUBG US Male 45 45-55 Engli… English US 5
## 3 US_English_No… PUBG US Male 45 45-55 Engli… English US 5
## 4 US_English_No… PUBG US Male 45 45-55 Engli… English US 5
## 5 US_English_No… PUBG US Male 45 45-55 Engli… English US 5
## 6 US_English_No… PUBG US Male 45 45-55 Engli… English US 5
## # … with 7 more variables: hardcore <fct>, accessibility <fct>, labeler <chr>,
## # ambassador <chr>, label <fct>, answer <int>, gender <chr>, and abbreviated
## # variable names ¹Language, ²experience
## # ℹ Use `colnames()` to see all variable names
2.1 Participant breakdown by gender, ageCut, hardcore, accessibility, country, game
# 5,991 participants in total
nrow(df %>% distinct(Response.ID))## [1] 5991
# gender, age, experience, accessibility (disability) breakdowns
# gender: Male, Female, and Nonbinary.
df %>% distinct(gender, Response.ID) %>% count(gender)## # A tibble: 3 × 2
## gender n
## <chr> <int>
## 1 Female 432
## 2 Male 5497
## 3 NonBinary 62
# ageCut: 18-24, 25-34, 35-44, 45-55, 55+. Per Xbox internal age breakdowns.
df %>% distinct(ageCut, Response.ID) %>% count(ageCut)## # A tibble: 6 × 2
## ageCut n
## <fct> <int>
## 1 18-24 1182
## 2 25-34 1620
## 3 35-44 1641
## 4 45-55 773
## 5 55+ 111
## 6 <NA> 664
# hardcore: hardcore (play games over 1 time a week), non-hardcore (play games 1 time or less)
df %>% distinct(hardcore, Response.ID) %>% count(hardcore)## # A tibble: 2 × 2
## hardcore n
## <fct> <int>
## 1 hardcore 5608
## 2 non-hardcore 383
# accessibility: or really disability, Yes/No. Those people who incated that they have disability or not
df %>% distinct(accessibility, Response.ID) %>% count(accessibility)## # A tibble: 4 × 2
## accessibility n
## <fct> <int>
## 1 No 5466
## 2 Prefer not to answer 112
## 3 Yes 328
## 4 <NA> 85
# countries
df %>% distinct(country, Response.ID) %>% count(country)## # A tibble: 16 × 2
## country n
## <fct> <int>
## 1 US 3040
## 2 Germany 245
## 3 Poland 123
## 4 Greece 319
## 5 Japan 237
## 6 Korea 165
## 7 Singapore 69
## 8 India 99
## 9 Saudi.Arabia 35
## 10 South.Africa 212
## 11 Nigeria 170
## 12 Brazil 280
## 13 Argentina 182
## 14 Colombia 165
## 15 Chile 193
## 16 Mexico 457
# games
df %>% distinct(game, Response.ID) %>% count(game)## # A tibble: 11 × 2
## game n
## <chr> <int>
## 1 Animal Crossing: New Horizons 1102
## 2 Call of Duty: Vanguard 1718
## 3 Candy Crush 799
## 4 Elden Ring 1896
## 5 FIFA22 1123
## 6 Grand Theft Auto V 3168
## 7 Mario Kart 8 1651
## 8 Minecraft 2631
## 9 PUBG 1241
## 10 Stardew Valley 1043
## 11 The Sims 3 429
2.2 Label correlations
- In this plot, we look at correlation between labels from our sample. We can observe that difficulty, violent, action, action.motivation, control_complexity, strategy, learning_curve tend to go together. Comedic, creativity, pacifist, zen,made.for.kids, cozy are clustered together.
library(corrplot)## corrplot 0.92 loaded
# disregard the NA
res <- df %>%
filter(!(label %in% c("NA.positive.opinion", "NA.negative.opinion", "NA.feeling", "NA.art"))) %>%
mutate(label=factor(label, levels=unique(df$label))) %>% # keep order in the final graph Figure 1
select(Response.ID, label, answer) %>%
pivot_wider(., names_from = label, values_from = answer) %>%
unnest() %>%
select(-Response.ID)
mx <- cor(res, use = "complete.obs")
corrplot(mx, type = "upper", order = "FPC", tl.cex=2,
tl.col = "black", tl.srt = 45, title="Figure 1: Label Correlation")3 Figure 2a: Consider country comparisons in three ways
This step is optional. We remove games where a country has very few response (< 5). In other words, we find games with >=5 participants across all countries.
# filter out countries with few data point
# for each game, remove countries with few data points
constraints <- df %>% count(game, country, label) %>% # group by country and label
select(game, country, n) %>% distinct() %>%
filter(n < 5)
# remove games that have too few response in any country
df <- df %>% filter(!(game %in% unique(constraints$game))) %>%
filter(!(label %in% c("NA.positive.opinion", "NA.negative.opinion", "NA.feeling", "NA.art")))
# game included
unique(df$game)## [1] "PUBG" "Grand Theft Auto V" "Minecraft"
## [4] "Elden Ring" "FIFA22" "Call of Duty: Vanguard"
3.1 Within each country, how much “internal” agreement is there on labels of interest?
We seek to answer questions such as does the concept of a “cozy game” elicit more agreement within the US than it does in Japan? To achieve so, we calculate the average standard deviation/variance of label outcomes and comparing across countries. We plot standard deviation to reduce the length of this document.
Code Description: plotFigure2a is a wrapper function that plots all 28 label output. The first three are 1-3 scale, 1-4 scale for difficulty the latter 24 are 0-1 scale. plotVariance is the plotting function.
# wrapper function that enable individual game
plotFigure2a <- function(game = "All", sd=FALSE, debug=FALSE) {
var.df <- df %>% group_by(game, country, label) %>%
summarise(mean_answer = mean(answer, na.rm=TRUE),
variance = var(answer, na.rm=TRUE),
sd=sd(answer, na.rm=TRUE))
if(game == "All" && !sd) {
# average the variances across games for each country
country.df <- var.df %>% group_by(country, label) %>%
summarise(y=mean(variance, na.rm=TRUE))
} else if (game == "All" && sd) {
country.df <- var.df %>% group_by(country, label) %>%
summarise(y=mean(sd, na.rm=TRUE))
} else if(game != "All" && !sd) {
country.df <- var.df %>% filter(game == {{ game }}) %>% mutate(y=variance)
} else if (game != "All" && sd) {
country.df <- var.df %>% filter(game == {{ game }}) %>% mutate(y=sd)
} else {
print("Error...")
}
p1 <- plotVariance(country.df, gameTitle=game, sd=sd, binary=FALSE)
p2 <- plotVariance(country.df, gameTitle=game, sd=sd, binary=TRUE)
res <- ggarrange(p1, p2, ncol=1, heights = c(1, 6))
return(res)
}
# plot variance of a label across countries
plotVariance <- function(inputDf, gameTitle="All", sd=FALSE, binary=FALSE) {
# color the US
inputDf$color <- inputDf$country == "US"
plotDf <- inputDf[with(inputDf, order(label, country, y)),] %>%
filter(!label %in% c("control_complexity", "learning_curve", "difficulty", "replayability"))
if(!binary) {
plotDf <- inputDf[with(inputDf, order(label, country, y)),] %>%
filter(label %in% c("control_complexity", "learning_curve", "difficulty", "replayability"))
}
if(sd) {
yLabel <- "Standard Devaition"
} else {
yLabel <- "Variance"
}
p <- plotDf %>% ggplot(aes(x=reorder_within(country, y, label), y=y, color=color)) +
geom_point() +
scale_x_reordered() +
facet_wrap(~label, ncol=4, scales = "free") +
ylim(c(0, 3)) +
scale_color_manual(values=c("#999999", "#56B4E9")) +
labs(y = yLabel, x = NULL,
title = paste0("How much internal agreement is there on labels of interest (", gameTitle, ")"))
if(binary) {
p <- p + ylim(c(0, 1))
}
return(p)
}3.1.1 Taken all games together.
Here we look at the standard deviation across all the games. We look at all the games together. (e.g., Does the concept of a “cozy game” elicit more agreement within the US than it does in Japan regardless of games?) The US is highlighted in the graph in Figure 2a.
We can observe that Saudi Arabia with the highest variability (i.e., standard deviation) 11 times, South Africa 2 times, Nigeria 7 times, Singapore 4 times, Japan 1 time, India 1 time, Mexico 1 time, and Colombia 1 time.
plotFigure2a("All", sd=TRUE, debug=FALSE)3.1.2 Breakdown by game
Here we look at the variance/standard deviation in individual games. We look at all the games together. (e.g., Does the concept of a “cozy game” elicit more agreement within the US than it does in Japan in specific games?)
Note that not in one time did the US has the highest standard deviation.
3.1.2.1 PUBG
plotFigure2a("PUBG", sd=TRUE, debug=FALSE)## `summarise()` has grouped output by 'game', 'country'. You can override using
## the `.groups` argument.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
3.1.2.2 Grand Theft Auto V
plotFigure2a("Grand Theft Auto V", sd=TRUE, debug=FALSE)## `summarise()` has grouped output by 'game', 'country'. You can override using
## the `.groups` argument.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
3.1.2.3 Minecraft
plotFigure2a("Minecraft", sd=TRUE, debug=FALSE)## `summarise()` has grouped output by 'game', 'country'. You can override using
## the `.groups` argument.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
3.1.2.4 Elden Ring
plotFigure2a("Elden Ring", sd=TRUE, debug=FALSE)## `summarise()` has grouped output by 'game', 'country'. You can override using
## the `.groups` argument.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
3.1.2.5 FIFA22
plotFigure2a("FIFA22", sd=TRUE, debug=FALSE)## `summarise()` has grouped output by 'game', 'country'. You can override using
## the `.groups` argument.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
3.1.2.6 Call of Duty: Vanguard
plotFigure2a("Call of Duty: Vanguard", sd=TRUE, debug=FALSE)## `summarise()` has grouped output by 'game', 'country'. You can override using
## the `.groups` argument.
## Scale for 'y' is already present. Adding another scale for 'y', which will
## replace the existing scale.
3.2 Within each region, how much “internal” agreement is there on labels of interest?
In this case, we group countries to higher-level regions based on the World Value Survey map. We perform the same analysis but on a regional level.
# wrapper function that enable individual game
plotFigure2aRegion <- function(game = "All", sd=FALSE, debug=FALSE) {
var.df <- df %>% group_by(game, region, label) %>%
summarise(mean_answer = mean(answer, na.rm=TRUE),
variance = var(answer, na.rm=TRUE),
sd=sd(answer, na.rm=TRUE))
if(game == "All" && !sd) {
# average the variances across games for each region
region.df <- var.df %>% group_by(region, label) %>%
summarise(y=mean(variance, na.rm=TRUE))
} else if (game == "All" && sd) {
region.df <- var.df %>% group_by(region, label) %>%
summarise(y=mean(sd, na.rm=TRUE))
} else if(game != "All" && !sd) {
region.df <- var.df %>% filter(game == {{ game }}) %>% mutate(y=variance)
} else if (game != "All" && sd) {
region.df <- var.df %>% filter(game == {{ game }}) %>% mutate(y=sd)
} else {
print("Error...")
}
p1 <- plotVariance(region.df, gameTitle=game, sd=sd, binary=FALSE)
p2 <- plotVariance(region.df, gameTitle=game, sd=sd, binary=TRUE)
res <- ggarrange(p1, p2, ncol=1, heights = c(1, 6))
return(res)
}
# plot variance of a label across countries
plotVariance <- function(inputDf, gameTitle="All", sd=FALSE, binary=FALSE) {
# color the US
inputDf$color <- inputDf$region == "US"
plotDf <- inputDf[with(inputDf, order(label, region, y)),] %>%
filter(!label %in% c("control_complexity", "learning_curve", "difficulty", "replayability"))
if(!binary) {
plotDf <- inputDf[with(inputDf, order(label, region, y)),] %>%
filter(label %in% c("control_complexity", "learning_curve", "difficulty", "replayability"))
}
if(sd) {
yLabel <- "Standard Devaition"
} else {
yLabel <- "Variance"
}
p <- plotDf %>% ggplot(aes(x=reorder_within(region, y, label), y=y, color=color)) +
geom_point() +
scale_x_reordered() +
facet_wrap(~label, ncol=4, scales = "free") +
ylim(c(0, 3)) +
scale_color_manual(values=c("#999999", "#56B4E9")) +
labs(y = yLabel, x = NULL,
title = paste0("How much internal agreement is there on labels of interest (", gameTitle, ")"))
if(binary) {
p <- p + ylim(c(0, 1))
}
return(p)
}4 Figure 2b: How do different populations view these labels in specific contexts?
To answer this question, we calculate differences in estimated mean label outcomes for each demographic using causal matching analysis. To predict the probability of having a game label in a certain country, we use multilevel regression and post-stratification (MRP).
# Import libraries
source("read.R")
library(ggalt)
library(reshape2)
options(dplyr.summarise.inform = FALSE)4.1 Calculate differences in estimated mean label outcomes for each demographic using causal matching analysis
We first check the differences between means for each labels across countries below. We run the following analyzeCountry function to each label and explores whether there is a significant difference between US and non-US participants in our matched dataframe.
analyzeCountry <- function(inputDf, label, lower=1, upper=3) {
df <- inputDf %>% filter(label == {{ label }}) %>%
mutate(is_us = ifelse(country == "US", 1, 0)) %>%
mutate(gender.no.nonbinary = ifelse(gender == "Male", 1, 0)) %>% # depending on how we want to deal with this
mutate(gender.no.nonbinary = factor(gender.no.nonbinary)) %>%
na.omit()
# causal matching using matchit
m.out <- matchit(is_us ~ game + ageCut + hardcore + gender.no.nonbinary, method = "nearest", distance = "mahalanobis", link = "probit", replace = TRUE, data=df)
# optional for debugging
s.out <- summary(m.out, standardize = TRUE)
plot(s.out)
matched.df <- match.data(m.out)
matched.df$is_us <- factor(matched.df$is_us)
model <- lm(answer ~ is_us, data=matched.df)
print(summary(model))
annotator <- df %>% filter(labeler == "Yes" & label == {{ label }}) %>%
count(game, answer) %>%
group_by(game) %>%
summarize(majority.vote = mean(answer[which(n==max(n))])) %>% ungroup()
p <- plotByGame(matched.df, "is_us", label, LOWER=lower, UPPER = upper) +
geom_point(aes(x=game, y=majority.vote, color="Polish Annotator", shape="Polish Annotator"), size=2, data=annotator) + # majority.vote is the Polish annotator majority vote
labs(title=paste0(label, " Score by Country"), y="Mean Score", x="Games") +
scale_color_discrete("Country")
return(p)
}- In the remainder of this section, we look at each individual label. For each label, we can look at whether there is a significant difference between US and non-US participants. For example, learning curve, replaybility, zen, space, violent, action, comedic, grinding, anime, motivation:action, motivation:social, motivation:immersion are significantly different.
4.1.1 Control Complexity
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6644 -0.6257 0.3356 0.3743 1.3743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.66443 0.05624 29.597 <2e-16 ***
## is_us1 -0.03875 0.05673 -0.683 0.495
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6865 on 8669 degrees of freedom
## Multiple R-squared: 5.384e-05, Adjusted R-squared: -6.151e-05
## F-statistic: 0.4667 on 1 and 8669 DF, p-value: 0.4945
4.1.2 Learning Curve
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7651 -0.6227 -0.6227 0.3773 1.3773
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.76510 0.06333 27.871 <2e-16 ***
## is_us1 -0.14236 0.06388 -2.228 0.0259 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7731 on 8669 degrees of freedom
## Multiple R-squared: 0.0005725, Adjusted R-squared: 0.0004572
## F-statistic: 4.966 on 1 and 8669 DF, p-value: 0.02588
4.1.3 Difficulty
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90894 -0.90894 0.09106 0.10067 2.10067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.899329 0.076427 24.852 <2e-16 ***
## is_us1 0.009613 0.077092 0.125 0.901
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9329 on 8669 degrees of freedom
## Multiple R-squared: 1.793e-06, Adjusted R-squared: -0.0001136
## F-statistic: 0.01555 on 1 and 8669 DF, p-value: 0.9008
4.1.4 Replayability
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.4363 -0.4363 0.5637 0.5637 0.7047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.29530 0.05902 38.892 <2e-16 ***
## is_us1 0.14098 0.05953 2.368 0.0179 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7204 on 8669 degrees of freedom
## Multiple R-squared: 0.0006465, Adjusted R-squared: 0.0005313
## F-statistic: 5.608 on 1 and 8669 DF, p-value: 0.0179
4.1.5 Pacifist
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2013 -0.1641 -0.1641 -0.1641 0.8359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.20134 0.03039 6.626 3.65e-11 ***
## is_us1 -0.03730 0.03065 -1.217 0.224
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3709 on 8669 degrees of freedom
## Multiple R-squared: 0.0001708, Adjusted R-squared: 5.544e-05
## F-statistic: 1.481 on 1 and 8669 DF, p-value: 0.2237
4.1.6 Made for kids
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3557 -0.3202 -0.3202 0.6798 0.6798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.35570 0.03824 9.301 <2e-16 ***
## is_us1 -0.03547 0.03858 -0.920 0.358
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4668 on 8669 degrees of freedom
## Multiple R-squared: 9.754e-05, Adjusted R-squared: -1.781e-05
## F-statistic: 0.8456 on 1 and 8669 DF, p-value: 0.3578
4.1.7 Cozy
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3044 -0.3044 -0.3044 0.6956 0.7248
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.27517 0.03768 7.302 3.08e-13 ***
## is_us1 0.02922 0.03801 0.769 0.442
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.46 on 8669 degrees of freedom
## Multiple R-squared: 6.817e-05, Adjusted R-squared: -4.718e-05
## F-statistic: 0.591 on 1 and 8669 DF, p-value: 0.4421
4.1.8 Zen
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2773 -0.2773 -0.2773 0.7227 0.7920
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.20805 0.03662 5.681 1.38e-08 ***
## is_us1 0.06923 0.03694 1.874 0.061 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.447 on 8669 degrees of freedom
## Multiple R-squared: 0.000405, Adjusted R-squared: 0.0002897
## F-statistic: 3.512 on 1 and 8669 DF, p-value: 0.06096
4.1.9 Fantasy
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3340 -0.3340 -0.3340 0.6660 0.6711
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.32886 0.03864 8.511 <2e-16 ***
## is_us1 0.00510 0.03898 0.131 0.896
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4716 on 8669 degrees of freedom
## Multiple R-squared: 1.975e-06, Adjusted R-squared: -0.0001134
## F-statistic: 0.01712 on 1 and 8669 DF, p-value: 0.8959
4.1.10 Space
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.08054 -0.01877 -0.01877 -0.01877 0.98123
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.08054 0.01141 7.061 1.78e-12 ***
## is_us1 -0.06176 0.01150 -5.368 8.15e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1392 on 8669 degrees of freedom
## Multiple R-squared: 0.003313, Adjusted R-squared: 0.003198
## F-statistic: 28.82 on 1 and 8669 DF, p-value: 8.151e-08
4.1.11 Heroic
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2537 -0.2537 -0.2537 0.7463 0.8054
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.19463 0.03560 5.467 4.69e-08 ***
## is_us1 0.05907 0.03591 1.645 0.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4345 on 8669 degrees of freedom
## Multiple R-squared: 0.000312, Adjusted R-squared: 0.0001967
## F-statistic: 2.706 on 1 and 8669 DF, p-value: 0.1
4.1.12 Real World
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3624 -0.3545 -0.3545 0.6455 0.6455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.362416 0.039197 9.246 <2e-16 ***
## is_us1 -0.007922 0.039538 -0.200 0.841
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4785 on 8669 degrees of freedom
## Multiple R-squared: 4.631e-06, Adjusted R-squared: -0.0001107
## F-statistic: 0.04014 on 1 and 8669 DF, p-value: 0.8412
4.1.13 Violent
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4066 -0.4066 -0.4066 0.5934 0.8054
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.19463 0.04012 4.851 1.25e-06 ***
## is_us1 0.21196 0.04047 5.237 1.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4898 on 8669 degrees of freedom
## Multiple R-squared: 0.003154, Adjusted R-squared: 0.003039
## F-statistic: 27.43 on 1 and 8669 DF, p-value: 1.669e-07
4.1.14 Action
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4703 -0.4703 -0.4703 0.5297 0.6778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.32215 0.04085 7.886 3.5e-15 ***
## is_us1 0.14816 0.04121 3.596 0.000325 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4986 on 8669 degrees of freedom
## Multiple R-squared: 0.001489, Adjusted R-squared: 0.001374
## F-statistic: 12.93 on 1 and 8669 DF, p-value: 0.0003253
4.1.15 Emotional
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2081 -0.1907 -0.1907 -0.1907 0.8093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.20805 0.03221 6.460 1.1e-10 ***
## is_us1 -0.01737 0.03249 -0.535 0.593
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3931 on 8669 degrees of freedom
## Multiple R-squared: 3.298e-05, Adjusted R-squared: -8.237e-05
## F-statistic: 0.2859 on 1 and 8669 DF, p-value: 0.5929
4.1.16 Comedic
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4488 -0.4488 -0.4488 0.5512 0.6309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.36913 0.04073 9.063 <2e-16 ***
## is_us1 0.07971 0.04109 1.940 0.0524 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4972 on 8669 degrees of freedom
## Multiple R-squared: 0.000434, Adjusted R-squared: 0.0003187
## F-statistic: 3.764 on 1 and 8669 DF, p-value: 0.0524
4.1.17 Experimental
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.1745 -0.1632 -0.1632 -0.1632 0.8368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.17450 0.03029 5.760 8.69e-09 ***
## is_us1 -0.01127 0.03056 -0.369 0.712
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3698 on 8669 degrees of freedom
## Multiple R-squared: 1.57e-05, Adjusted R-squared: -9.966e-05
## F-statistic: 0.1361 on 1 and 8669 DF, p-value: 0.7122
4.1.18 Strategy
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4488 -0.4488 -0.4488 0.5512 0.5571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.442953 0.040750 10.870 <2e-16 ***
## is_us1 0.005885 0.041105 0.143 0.886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4974 on 8669 degrees of freedom
## Multiple R-squared: 2.365e-06, Adjusted R-squared: -0.000113
## F-statistic: 0.0205 on 1 and 8669 DF, p-value: 0.8862
4.1.19 Grinding
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5116 -0.5116 0.4884 0.4884 0.6510
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.34899 0.04092 8.528 < 2e-16 ***
## is_us1 0.16262 0.04128 3.940 8.23e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4995 on 8669 degrees of freedom
## Multiple R-squared: 0.001787, Adjusted R-squared: 0.001672
## F-statistic: 15.52 on 1 and 8669 DF, p-value: 8.227e-05
4.1.20 Anime
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.17450 -0.07897 -0.07897 -0.07897 0.92103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.17450 0.02228 7.831 5.40e-15 ***
## is_us1 -0.09552 0.02248 -4.250 2.16e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.272 on 8669 degrees of freedom
## Multiple R-squared: 0.002079, Adjusted R-squared: 0.001964
## F-statistic: 18.06 on 1 and 8669 DF, p-value: 2.16e-05
4.1.21 Hand drawn
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14094 -0.09939 -0.09939 -0.09939 0.90061
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.14094 0.02459 5.732 1.02e-08 ***
## is_us1 -0.04155 0.02480 -1.675 0.0939 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3001 on 8669 degrees of freedom
## Multiple R-squared: 0.0003237, Adjusted R-squared: 0.0002083
## F-statistic: 2.807 on 1 and 8669 DF, p-value: 0.09391
4.1.22 Stylized
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5394 -0.5394 0.4606 0.4606 0.4832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.51678 0.04084 12.65 <2e-16 ***
## is_us1 0.02265 0.04120 0.55 0.582
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4985 on 8669 degrees of freedom
## Multiple R-squared: 3.487e-05, Adjusted R-squared: -8.048e-05
## F-statistic: 0.3023 on 1 and 8669 DF, p-value: 0.5825
4.1.23 Motivation: Action
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5419 -0.5419 0.4581 0.4581 0.6067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.39333 0.04067 9.671 < 2e-16 ***
## is_us1 0.14860 0.04103 3.622 0.000294 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4981 on 8649 degrees of freedom
## Multiple R-squared: 0.001514, Adjusted R-squared: 0.001399
## F-statistic: 13.12 on 1 and 8649 DF, p-value: 0.0002941
4.1.25 Motivation: Mastery
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4607 -0.4607 -0.4607 0.5393 0.5800
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.42000 0.04070 10.32 <2e-16 ***
## is_us1 0.04065 0.04105 0.99 0.322
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4984 on 8649 degrees of freedom
## Multiple R-squared: 0.0001134, Adjusted R-squared: -2.252e-06
## F-statistic: 0.9805 on 1 and 8649 DF, p-value: 0.3221
4.1.26 Motivation: Achievement
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5292 -0.5292 0.4708 0.4708 0.5000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.50000 0.04076 12.267 <2e-16 ***
## is_us1 0.02923 0.04112 0.711 0.477
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4992 on 8649 degrees of freedom
## Multiple R-squared: 5.843e-05, Adjusted R-squared: -5.718e-05
## F-statistic: 0.5054 on 1 and 8649 DF, p-value: 0.4772
4.1.27 Motivation: Immersion
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4948 -0.4948 -0.3733 0.5052 0.6267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.37333 0.04080 9.149 < 2e-16 ***
## is_us1 0.12143 0.04116 2.950 0.00319 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4998 on 8649 degrees of freedom
## Multiple R-squared: 0.001005, Adjusted R-squared: 0.0008897
## F-statistic: 8.703 on 1 and 8649 DF, p-value: 0.003186
4.1.28 Motivation: Creativity
##
## Call:
## lm(formula = answer ~ is_us, data = matched.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4567 -0.4567 -0.4567 0.5433 0.5733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.42667 0.04067 10.491 <2e-16 ***
## is_us1 0.02999 0.04103 0.731 0.465
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4981 on 8649 degrees of freedom
## Multiple R-squared: 6.175e-05, Adjusted R-squared: -5.386e-05
## F-statistic: 0.5341 on 1 and 8649 DF, p-value: 0.4649
5 Figure X. Poststratification and Correlations
To predict the probability of having a game label in a certain country, we use multilevel regression and post-stratification (MRP).
5.1 Set up
# Import libraries
source("read.R")
library(ggalt)
library(reshape2)
library(lme4)
options(dplyr.summarise.inform = FALSE)In this subsection, we try to predict a probability of labeling a certain game with a particular Genome label by different countries.
# read csv
countryLevelDf <- read_csv("demographic.csv")## Rows: 7167 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country, AgeGroup, Gender
## dbl (1): UserCount
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
countryLevelDf$UserCount <- as.numeric(countryLevelDf$UserCount)
# make the country name consistent with the individual survey dataframe
countryLevelDf <- countryLevelDf %>%
filter(Country %in% c("Argentina", "Brazil", "Chile", "Colombia", "Germany", "Greece", "India", "Japan", "Korea", "Mexico", "Nigeria", "Poland", "Saudi Arabia", "Singapore", "South Africa", "United States")) %>%
dplyr::rename(ageCut = AgeGroup, country = Country, gender = Gender) %>%
mutate(ageCut = case_when(ageCut == "18 - 24" ~ "18-24",
ageCut == "25 - 34" ~ "25-34",
ageCut == "35 - 44" ~ "35-44",
ageCut == "45 - 55" ~ "45-55",
ageCut == "> 55" ~ "55+"))
countryLevelDf$country[countryLevelDf$country == "United States"] <- "US"
countryLevelDf$country[countryLevelDf$country == "Saudi Arabia"] <- "Saudi.Arabia"
countryLevelDf$country[countryLevelDf$country == "South Africa"] <- "South.Africa"
countryLevelDf <- countryLevelDf %>% mutate(region = case_when(country %in% c("Japan", "Korea") ~ "Confucian",
country %in% c("Singapore", "India") ~ "West.South.Asia",
country %in% c("Argentina", "Chile", "Colombia", "Mexico", "Brazil") ~ "Latin.America",
country %in% c("Germany") ~ "Protestant.Europe",
country %in% c("South.Africa", "Nigeria", "Saudi.Arabia") ~ "Islamic",
country %in% c("Poland") ~ "Catholic.Europe",
country %in% c("Greece") ~ "Orthodox.Europe",
country %in% c("US") ~ "US"))
countryLevelDf$country <- factor(countryLevelDf$country)
countryLevelDf$ageCut <- factor(countryLevelDf$ageCut, c("18-24","25-34", "35-44", "45-55", "55+"))
countryLevelDf$gender <- factor(countryLevelDf$gender)
countryLevelDf$region = factor(countryLevelDf$region)
# Calculate the percentage of age+gender group by country
countryLevelDf <- countryLevelDf %>% group_by(country) %>%
mutate(UserCount.per = UserCount/sum(UserCount, na.rm=TRUE))
countryLevelDf <- countryLevelDf %>% filter(ageCut != "" & ageCut != "Unknown" & !is.na(ageCut)) %>%
filter(gender != "" & gender != "Unknown" & !is.na(gender)) %>%
mutate(gender = ifelse(gender == "Male", "Male", "Female or Nonbinary"))5.2 Exploratory Data Analysis
We examine the difference between our sample of ~5000 participants and the true MS population.
df <- df %>% mutate(gender = ifelse(gender == "Male", "Male", "Female or Nonbinary"))
age_sample <- df %>%
dplyr::mutate(age = factor(ageCut, ordered = FALSE)) %>%
dplyr::group_by(age) %>%
dplyr::summarise(n = n()) %>%
dplyr::mutate(Sample = n/sum(n))
age_post <- countryLevelDf %>%
dplyr::mutate(age = factor(ageCut, ordered = FALSE)) %>%
dplyr::group_by(age) %>%
dplyr::summarise(n_post = sum(UserCount)) %>%
dplyr::mutate(Population = n_post/sum(n_post))
age <- dplyr::inner_join(age_sample, age_post, by = "age") %>% select(age, Sample, Population)
age_plot <- ggplot() +
ylab("") + xlab("Proportion") + theme_bw() + coord_flip() +
geom_dumbbell(data = age, aes(y = age, x = Sample, xend = Population)) +
geom_point(data = reshape2::melt(age, id = "age"), aes(y = age, x = value, color = variable), size = 2) +
scale_x_continuous(limits = c(0, 0.5), breaks = c(0, .1, .2, .3, .4, .5)) +
ggtitle("Age")
age_plot# Gender because the poststratification dataset does not have nonbinary
male_sample <- df %>%
dplyr::mutate(gender = ifelse(gender != "Male", "Female or Nonbinary", "Male")) %>%
dplyr::group_by(gender) %>%
dplyr::summarise(n = n()) %>%
dplyr::mutate(Sample = n/sum(n))
male_post <- countryLevelDf %>%
dplyr::group_by(gender) %>%
dplyr::summarise(n_post = sum(UserCount)) %>%
dplyr::mutate(Population = n_post/sum(n_post))
male <- dplyr::inner_join(male_sample, male_post, by = "gender") %>% select(gender, Sample, Population)
male_plot <- ggplot() +
ylab("") + xlab("") + theme_bw() + coord_flip() +
geom_dumbbell(data = male, aes(y = gender, x = Sample, xend = Population)) +
geom_point(data = reshape2::melt(male, id = "gender"), aes(y = gender, x = value, color = variable), size = 2) +
scale_x_continuous(limits = c(0, 1), breaks = c(0, .2, .4, .6, .8, 1.0)) + ggtitle("Gender")
male_plot5.3 Poststratification for country
Below, I will show a Multi-level regression and poststratification (MRP) on one label (violent) and one game (Grand Theft Auto V) as an example. I will use the same approach to generate a prediction for all label-game pair for different countries. In the end, I will use the predicted value as element in the label vector to compare with Hofstede’s cultural dimensions.
5.3.1 MPR Example for one label and one game
5.3.1.1 Prep the groundtruth dataset
# First we need to prep the groundtruth dataset when merging the groundtruth countryLevelDf for post-stratification
# Make the dataset from a long format to a wide format
wide.df <- countryLevelDf %>% dplyr::select(country, gender, ageCut, UserCount) %>%
pivot_wider(names_from = country, values_from = UserCount)
wide.df <- wide.df[with(wide.df, order(gender, ageCut)),]
wide.df <- wide.df %>% dplyr::select(-ageCut, -gender)
# Ascertain the order of the random variables
country.order <- sort(colnames(wide.df))
age.order <- unique(df$ageCut[!is.na(df$ageCut)])
gender.order <- unique(df$gender[!is.na(df$gender)])# Filter a smaller dataset and generate the model
violent <- df %>% filter(label == "violent" & game == "Grand Theft Auto V")
mod <- glmer(answer ~ 1 + (1|gender) + (1|ageCut) + (1|country), data=violent, family = binomial("probit"))5.3.1.2 Generate Prediction for ideal types (5 ageCuts * 2 genders * 16 countries)
# We have 10 individual level (5 ageCut * 2 gender) living in 16 countries. There are 10*16 specific ideal types.
# Reference of the implementation: https://github.com/lleemann/MrP_chapter/blob/master/MrP_Illsutration.pdf
re.gender.coef <- ranef(mod)$gender
re.age.coef <- ranef(mod)$ageCut
re.country.coef <- ranef(mod)$country
rownames(re.gender.coef) <- gender.order
rownames(re.age.coef) <- age.order
rownames(re.country.coef) <- country.order
re.gender <- re.gender.coef[[1]]
re.ageCut <- re.age.coef[[1]]
re.country <- re.country.coef[[1]]
gender.re <- rep(re.gender, 5)
age.re <- rep(kronecker(re.ageCut, c(1)), 1)
ind.re <- rowSums(cbind(gender.re, age.re))
ind.re <- ind.re + fixef(mod)
y.lat <- rep(NA, 16*10)
for(i in 1:16) {
a <- ((i-1)*10)+1
b <- a + 9
y.lat[a:b] <- ind.re + re.country[i]
}
# Here we use pnorm to transform the scores in y.lat1 to predicted probability
# In other words, what we have doe was on the latent variable, but we are really interested in the predicted probability
p1 <- pnorm(y.lat)
# p1 is the vector for 10 individual levels in 16 countries
# It's essentially the random effect coefficients
length(p1) ## [1] 160
5.3.1.3 Post-stratify by ideal types
# Next, we can post-stratify and make predictions
a <- c(data.matrix(wide.df)) # get the number of groudtruth user count in the 160 categories above
# Post-stratified number for a label for a game (in this case "violent", GTA V)
# This number shows the prediction regardless of countries: pretty consistent hah
sum(p1*a)/sum(a)## [1] 0.7234758
# Now we post-stratify the individual levels (5 ageCut * 2 gender) for each country
country.pred <- rep(NA, 16)
for(i in 1:16) {
# a1, a2 are indices
a1 <- ((i-1)*10) + 1
a2 <- a1 + 9
p1 <- pnorm(y.lat[a1:a2])
a <- wide.df[,i] # We don't need to worry about the order because we have considered the order when we generate y.lat1
country.pred[i] <- sum(p1*a) / sum(a)
}
# Chile is really not thinking that GTA is a violent game
plot.df <- data.frame(country = factor(colnames(wide.df), levels=colnames(wide.df)), predicted=country.pred)
ggplot(aes(x=reorder(country, predicted), y=predicted), data=plot.df) +
geom_point()5.3.2 MRP for all label-game pairs for each country
The goal here is to create a dataframe with columns: label, game, country, glm_pred, mrp_adjusted
# Let's consider 0-1 vector first
labels <- c("pacifist", "made.for.kids", "cozy", "zen", "fantasy", "space", "heroic", "real.world", "violent", "action", "emotional", "comedic", "experimental", "strategy", "grinding", "anime", "hand.drawn", "stylized", "action.motivation", "social", "mastery", "achievement", "immersion", "creativity")
games <- unique(df$game)
# This function repeats what we have done in the example above
getMRPPrediction <- function(mod) {
re.gender.coef <- ranef(mod)$gender
re.age.coef <- ranef(mod)$ageCut
re.country.coef <- ranef(mod)$country
rownames(re.gender.coef) <- gender.order
rownames(re.age.coef) <- age.order
rownames(re.country.coef) <- country.order
re.gender <- re.gender.coef[[1]]
re.ageCut <- re.age.coef[[1]]
re.country <- re.country.coef[[1]]
gender.re <- rep(re.gender, 5)
age.re <- rep(kronecker(re.ageCut, c(1)), 1)
ind.re <- rowSums(cbind(gender.re, age.re))
ind.re <- ind.re + fixef(mod)
y.lat <- rep(NA, 16*10)
for(i in 1:16) {
a <- ((i-1)*10)+1
b <- a + 9
y.lat[a:b] <- ind.re + re.country[i]
}
p1 <- pnorm(y.lat)
# post-stratification prediction
country.pred <- rep(NA, 16)
for(i in 1:16) {
a1 <- ((i-1)*10) + 1
a2 <- a1 + 9
p1 <- pnorm(y.lat[a1:a2])
a <- wide.df[,i]
country.pred[i] <- sum(p1*a) / sum(a)
}
# append MRP prediction
output.df <- data.frame(country = factor(colnames(wide.df), levels=country.order), mrp_pred=country.pred)
return(output.df)
}
getGLMPrediction = function(mod) {
pred <- expand.grid(country=country.order)
pred$glm_pred <- predict(mod, newdata=pred, type="response", allow.new.levels=TRUE)
return(pred)
}
# Here we create the final.df which is the post-stratified prediction for each country
final.df <- data.frame(label=c(), country=c(), game=c(), glm_pred=c(), mrp_adjusted=c())
for(label in labels){
for(game in games){
label.df <- df %>% filter(label == {{ label }} & game == {{ game }})
mod.glm <- glm(answer ~ country, data=label.df, family = binomial(link="logit"))
mod.mrp <- glmer(answer ~ 1 + (1|gender) + (1|ageCut) + (1|country), data=label.df, family = binomial("probit"))
glm.pred <- getGLMPrediction(mod.glm)
mrp.pred <- getMRPPrediction(mod.mrp)
mrp.pred <- mrp.pred %>% left_join(glm.pred, by="country")
mrp.pred$game <- game
mrp.pred$label <- label
final.df <- rbind(mrp.pred, final.df)
}
}
# post-stratified dataset
head(final.df)## country mrp_pred glm_pred game label
## 1 South.Africa 0.8286466 0.7692308 The Sims 3 creativity
## 2 Colombia 0.8376679 0.7500000 The Sims 3 creativity
## 3 Greece 0.8194868 0.8666667 The Sims 3 creativity
## 4 Korea 0.8249671 0.9999999 The Sims 3 creativity
## 5 India 0.8325641 0.9999999 The Sims 3 creativity
## 6 Japan 0.8127111 0.6250000 The Sims 3 creativity
## We need to filter out country with very low survey responses
filter.df <- df %>% group_by(label, country, game) %>%
filter(n() >= 5) %>%
count()
final.df <- final.df %>%
left_join(filter.df, by=c("country", "game", "label")) %>%
mutate(mrp_pred=ifelse(n < 5, NA, final.df$mrp_pred)) %>%
mutate(glm_pred=ifelse(n < 5, NA, final.df$glm_pred))head(final.df)## country mrp_pred glm_pred game label n
## 1 South.Africa 0.8286466 0.7692308 The Sims 3 creativity 13
## 2 Colombia 0.8376679 0.7500000 The Sims 3 creativity 12
## 3 Greece 0.8194868 0.8666667 The Sims 3 creativity 15
## 4 Korea 0.8249671 0.9999999 The Sims 3 creativity 6
## 5 India NA NA The Sims 3 creativity NA
## 6 Japan 0.8127111 0.6250000 The Sims 3 creativity 8
5.3.3 Heatmap
# The post-stratified prediction
final.df %>% filter(game == "Animal Crossing: New Horizons") %>%
ggplot(aes(x=country, y=factor(label, levels=rev(labels)), fill=mrp_pred)) +
geom_tile() +
geom_text(aes(label = round(mrp_pred, 2)), color="white", size=1) +
ggtitle("Animal Crossing: New Horizons")## Warning: Removed 72 rows containing missing values (geom_text).
5.3.4 Comparing post-stratified results with the Hofstede’s Dimensions.
5.3.4.1 Setup Hoftstede Cultural Dimensions.
Now we have the post-stratified results, we should compare with Hoftstede Cultural Dimensions.
# We create a dataframe where we calculate the cosine similarity between two countries
hofstedeDf <- read.csv("hoftstede.csv")
hofstedeDf <- hofstedeDf %>% select(-X) %>%
mutate(power.distance = power.distance/100,
individualism = individualism/100,
masculinity = masculinity/100,
uncertainty = uncertainty/100,
long.term.orientation = long.term.orientation/100,
indulgence = indulgence/100) # scale the hofstede score to 0 - 1
# Create a matrix with cosine similarity between countries
hofstedeMatrix = as.matrix(hofstedeDf[,2:7])
m <- dist(hofstedeMatrix, method = "euclidean", diag = TRUE, upper = TRUE, p = 2) # switch to Euclidean distance
m <- as.matrix(m, nrow=16, ncol=16)
rownames(m) <- hofstedeDf$country
colnames(m) <- hofstedeDf$country
cultureDf <- data.frame(countryA=rownames(m)[row(m)], countryB=colnames(m)[col(m)], culture.euclidean=c(m))
head(cultureDf)## countryA countryB culture.euclidean
## 1 India India 0.0000000
## 2 Singapore India 0.5605355
## 3 US India 0.8330066
## 4 Nigeria India 0.8075890
## 5 South.Africa India 0.6466065
## 6 Argentina India 0.7785242
5.3.4.2 Different distance metric
In this section, we will try both the Euclidean distance and correlation of labels. To compare each country, we will choose a 24-dimension vector that represents the post-stratified probability prediction for the 24 label for a game. We compare that distance (i.e., Euclidean or correlation) with the Hofstede’s Euclidean distance.
# Euclidean distance calculation
calEuclidean <- function(full, label) {
# mx <- cor(full, use = "pairwise.complete.obs")
mx <- t(as.matrix(full))
m <- dist(mx, method = "euclidean", diag = TRUE, upper = TRUE, p = 2)
m <- as.matrix(m, nrow=16, ncol=16)
retDf <- data.frame(countryA=rownames(m)[row(m)], countryB=colnames(m)[col(m)], label.euclidean=c(m))
retDf$label <- label
return(retDf)
}
# Correlation calculation
calCorr <- function(full, label) {
# mx <- cor(full, use = "pairwise.complete.obs")
mx <- as.matrix(full)
m <- cor(mx, method = "pearson", use = "pairwise.complete.obs")
retDf <- data.frame(countryA=rownames(m)[row(m)], countryB=colnames(m)[col(m)], label.corr=c(m))
retDf$label <- label
return(retDf)
}5.3.4.3 Euclidean distance
## Euclidean distance
# This is calculating the distance between label vectors (each dot is represented by labels)
# labels currently using the binary indicator variable
countries <- unique(df$country)
gameOrder <- unique(df$game[!is.na(df$game)])
labelOrder <- c("pacifist", "made.for.kids", "cozy", "zen", "fantasy", "space", "heroic", "real.world", "violent", "action", "emotional", "comedic", "experimental", "strategy", "grinding", "anime", "hand.drawn", "stylized", "action.motivation", "social", "mastery", "achievement", "immersion", "creativity")
output.game.ed <- data.frame(countryA=c(), countryB=c(), label.cos.sim=c(), label=c())
for(game in gameOrder) {
gameSpace <- data.frame(matrix(ncol=0, nrow=24))
for(country in countries) {
df.ed <- final.df %>% filter(country == !!country & game == !!game)
df.ed <- df.ed %>% arrange(factor(label, levels=labelOrder))
v <- df.ed$mrp_pred
gameSpace <- cbind(gameSpace, as.data.frame(v))
names(gameSpace)[names(gameSpace) == "v"] <- country
}
output.game.ed <- rbind(calEuclidean(gameSpace, game), output.game.ed)
}# Plot the results
filtered.ed <- cultureDf %>% left_join(output.game.ed, by=c("countryA", "countryB")) %>%
filter(countryA != countryB) %>% # remove duplicates (countryA, countryB) vs (countryB, countryA)
filter(!duplicated(paste0(pmax(countryA, countryB), pmin(countryA, countryB), label)))
# Overall for all games (~11) * country pairs (120) points
filtered.ed %>% ggplot(aes(x=culture.euclidean, y=label.euclidean)) +
geom_point() +
geom_smooth(method="lm")## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 157 rows containing non-finite values (stat_smooth).
## Warning: Removed 157 rows containing missing values (geom_point).
# Spearman correlation, the coefficiet rho=0.02 (super super weak positive correlation)
cor.test(filtered.ed$culture.euclidean, filtered.ed$label.euclidean, method="spearman", exact=FALSE)##
## Spearman's rank correlation rho
##
## data: filtered.ed$culture.euclidean and filtered.ed$label.euclidean
## S = 255073705, p-value = 0.3562
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.02707779
# For checking outliers
filtered.ed %>% ggplot(aes(x=culture.euclidean, y=label.euclidean, color=label)) +
geom_point() ## Warning: Removed 157 rows containing missing values (geom_point).
# For individual games
filtered.ed %>% ggplot(aes(x=culture.euclidean, y=label.euclidean)) +
geom_point() +
geom_smooth(method="lm") +
facet_wrap(~ label)## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 157 rows containing non-finite values (stat_smooth).
## Removed 157 rows containing missing values (geom_point).
5.3.4.4 Correlation on the game labels
# Correlation
# This is calculating the distance between label vectors (each dot is represented by labels)
# labels currently using the binary indicator variable
countries <- unique(df$country)
gameOrder <- unique(df$game[!is.na(df$game)])
labelOrder <- c("pacifist", "made.for.kids", "cozy", "zen", "fantasy", "space", "heroic", "real.world", "violent", "action", "emotional", "comedic", "experimental", "strategy", "grinding", "anime", "hand.drawn", "stylized", "action.motivation", "social", "mastery", "achievement", "immersion", "creativity")
output.game.corr <- data.frame(countryA=c(), countryB=c(), label.corr=c(), label=c())
for(game in gameOrder) {
gameSpace <- data.frame(matrix(ncol=0, nrow=24))
for(country in countries) {
df.corr <- final.df %>% filter(country == !!country & game == !!game)
df.corr <- df.corr %>% arrange(factor(label, levels=labelOrder))
v <- df.corr$mrp_pred
gameSpace <- cbind(gameSpace, as.data.frame(v))
names(gameSpace)[names(gameSpace) == "v"] <- country
}
output.game.corr<- rbind(calCorr(gameSpace, game), output.game.corr)
}For Correlation vs Cultural Distance Index (Euclidean distance), we’d want to see a negative trend, meaning the more correlated two countries are in terms of game labels, the farther apart.
# Plot the results
filtered.corr <- cultureDf %>% left_join(output.game.corr, by=c("countryA", "countryB")) %>%
filter(countryA != countryB) %>% # remove duplicates (countryA, countryB) vs (countryB, countryA)
filter(!duplicated(paste0(pmax(countryA, countryB), pmin(countryA, countryB), label)))
# Overall for all games (~11) * country pairs (120) points
filtered.corr %>% ggplot(aes(x=culture.euclidean, y=label.corr)) +
geom_point() +
geom_smooth(method="lm")## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 157 rows containing non-finite values (stat_smooth).
## Warning: Removed 157 rows containing missing values (geom_point).
# Spearman correlation, the coefficiet rho=0.02 (super super weak positive correlation)
cor.test(filtered.corr$culture.euclidean, filtered.corr$label.corr, method="spearman", exact=FALSE)##
## Spearman's rank correlation rho
##
## data: filtered.corr$culture.euclidean and filtered.corr$label.corr
## S = 273969839, p-value = 0.1251
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.04499733
# For checking outliers
filtered.corr %>% ggplot(aes(x=culture.euclidean, y=label.corr, color=label)) +
geom_point() ## Warning: Removed 157 rows containing missing values (geom_point).
# For individual games
filtered.corr %>% ggplot(aes(x=culture.euclidean, y=label.corr)) +
geom_point() +
geom_smooth(method="lm") +
facet_wrap(~ label)## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 157 rows containing non-finite values (stat_smooth).
## Removed 157 rows containing missing values (geom_point).
6 Figure 3: If label outcome differences are found across countries, is there a data-backed explanation relating to culture?
To answer this question, we use the Hofstede’s theory of cultural dimensions. This question answers whether culturally “closer” countries have more similar survey results.
- Define ‘culture’ using Hofstede’s cultural dimension map and calculate cultural similarity scores between pairs of countries
- Calculate correlations between sampled survey responses for individuals across pairs of countries
library(tidyverse)
library(tidytext)
library(data.table)
library("ggpubr")
source("read.R")
theme_set(
theme_bw() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size=8))
)
labels <- unique(df$label)# 1. Find the pairwise correlation between countries based on the Hofstede's framework
# correlation based on six metrics: power.distance, individualism, masculinity, uncertainty
# long.term.orientation, and indulgence
library(reshape)
hofstedeDf <- read.csv("hoftstede.csv")
hofstedeDf <- hofstedeDf %>% select(-X) %>%
mutate(power.distance = power.distance / 100,
individualism = individualism / 100,
masculinity = masculinity / 100,
uncertainty = uncertainty / 100,
long.term.orientation = long.term.orientation / 100,
indulgence = indulgence / 100) # scale the hofstede score to 0 - 1
corrMatrix <- cor(t(hofstedeDf[, -1])) # remove the first column with countries
corrMatrix[upper.tri(corrMatrix, diag = TRUE)] <- NA
rownames(corrMatrix) <- colnames(corrMatrix) <- hofstedeDf$country
corrMatrix <- na.omit(reshape::melt(t(corrMatrix)))
corrMatrix <- corrMatrix[ order(corrMatrix$X1, corrMatrix$X2), ]
hofstedeCorr <- corrMatrix %>% select(X1, X2, value) %>% dplyr::rename(countryA=X1, countryB=X2, hofstede.corr=value)
hofstedeCorr$countryA <- as.character(hofstedeCorr$countryA)
hofstedeCorr$countryB <- as.character(hofstedeCorr$countryB)
hofstedeCorr <- transform(hofstedeCorr, countryA = pmin(countryA, countryB), countryB=pmax(countryA, countryB))
head(hofstedeCorr)## countryA countryB hofstede.corr
## 17 India Singapore 0.44105595
## 33 India US -0.40050181
## 49 India Nigeria -0.06628162
## 65 India South.Africa -0.48927712
## 81 Argentina India -0.35943139
## 97 Brazil India 0.18844511
# Calculate correlations between sampled survey responses for individuals across pairs of countries
rnormt <- function(n, range, mu, s) {
# range is a vector of two values
F.a <- pnorm(min(range), mean = mu, sd = s)
F.b <- pnorm(max(range), mean = mu, sd = s)
u <- runif(n, min = F.a, max = F.b)
return(qnorm(u, mean = mu, sd = s))
}
countryCorr <- df %>% select(Response.ID, game, country, label, answer)
# Responses that are lower than 5 participants per game
summary <- countryCorr %>% filter(label == "learning_curve") %>%
group_by(country, game) %>%
summarise(n=n(), mean=mean(answer), sd=sd(answer)) %>%
ungroup() %>%
filter(n < 5)
# games that we will ignore altogether
gameTooFew <- unique(summary$game)
gameTooFew## [1] "The Sims 3" "Animal Crossing: New Horizons"
## [3] "Mario Kart 8" "Stardew Valley"
## [5] "Candy Crush"
# This function takes a matrix where columns are countries
# This function returns the correlation between countries
# countryA, countryB, label.corr
corr.df <- function(full, label) {
mx <- cor(full, use = "complete.obs")
mx[upper.tri(mx, diag = TRUE)] <- NA
rownames(mx) <- colnames(mx)
mx <- na.omit(reshape::melt(t(mx)))
mx <- mx[ order(mx$X1, mx$X2), ]
labelCorr <- mx %>% select(X1, X2, value)
labelCorr <- labelCorr %>% dplyr::rename(countryA=X1, countryB=X2, label.corr=value)
labelCorr$countryA <- as.character(labelCorr$countryA)
labelCorr$countryB <- as.character(labelCorr$countryB)
labelCorr <- transform(labelCorr, countryA = pmin(countryA, countryB), countryB=pmax(countryA, countryB))
labelCorr$label <- label
return(labelCorr)
}6.1 Game (11) * Label (28) vector correlation vs. Hofstede
Result: All the label correlation are very high, so this might not be what we want
# This step forms a game vector with 11 dimension per label/country with means
gameVector = function(inputrnormtA, inputCountry, inputLabel, lower, upper) {
inputCountry
inputLabel
dfA <- df %>% filter(label == !!inputLabel) %>%
mutate(answer = (1-0) * (answer-!!lower) / (!!upper-!!lower) + 0) %>% # the filter and mutate_at order is important and cannot be switched
filter(country == !!inputCountry) %>%
# filter(!game %in% gameTooFew) %>%
group_by(game) %>%
dplyr::summarise(mean=mean(answer), sd=sd(answer))
filterDf <- df %>% filter(label == !!inputLabel & country == !!inputCountry) %>%
group_by(game) %>%
filter(n() >= 5) %>%
count()
dfA <- dfA %>% mutate(mean=ifelse(!game %in% filterDf$game, NA, dfA$mean))
rnormtA <- inputrnormtA
return(c(rnormtA, c(dfA$mean)))
}countries <- unique(df$country)
labels <- c("control_complexity", "learning_curve", "difficulty", "replayability", "pacifist", "made.for.kids", "cozy", "zen", "fantasy", "space",
"heroic", "real.world", "violent", "action", "emotional", "comedic", "experimental", "strategy", "grinding", "anime", "hand.drawn",
"stylized", "action.motivation", "social", "mastery", "achievement", "immersion", "creativity")
output <- data.frame(countryA=c(), countryB=c(), label.corr=c(), label=c())
labelSpace <- data.frame(matrix(ncol=0, nrow=11))
for(label in labels) {
labelSpace <- data.frame(matrix(ncol=0, nrow=11))
for(country in countries) {
v <- c()
if(label %in% c("control_complexity", "learning_curve", "replayability")) {
v <- gameVector(v, country, label, lower=1, upper=3)
} else if (label == "difficulty") {
v <- gameVector(v, country, label, lower=1, upper=4)
} else {
v <- gameVector(v, country, label, lower=0, upper=1)
}
labelSpace <- cbind(labelSpace, as.data.frame(v))
names(labelSpace)[names(labelSpace) == "v"] <- country
}
output <- rbind(labelSpace, output)
}# Plot label correlation against the Hofstede scores
# Create a matrix where columns are countryA, countryB, and label.corr
mx <- cor(labelSpace, use = "pairwise.complete.obs")
mx[upper.tri(mx, diag = TRUE)] <- NA
rownames(mx) <- colnames(mx)
mx <- na.omit(reshape::melt(t(mx)))## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
## Warning in type.convert.default(X[[i]], ...): 'as.is' should be specified by the
## caller; using TRUE
mx <- mx[ order(mx$X1, mx$X2), ]
labelCorr <- mx %>% select(X1, X2, value)
labelCorr <- labelCorr %>% dplyr::rename(countryA=X1, countryB=X2, label.corr=value)
labelCorr$countryA <- as.character(labelCorr$countryA)
labelCorr$countryB <- as.character(labelCorr$countryB)
labelCorr <- transform(labelCorr, countryA = pmin(countryA, countryB), countryB=pmax(countryA, countryB))# Plotting the correlation
combined <- hofstedeCorr %>% left_join(labelCorr, by=c("countryA", "countryB"))
ggscatter(combined, x = "hofstede.corr", y = "label.corr",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "spearman",
xlab = "Hofstede", ylab = "Label")## `geom_smooth()` using formula 'y ~ x'
6.2 Game (11) vector correlation vs. Hofstede
This one compare a 11-dimensional vector between countries for each label. So each dot is a comparison between two countries for a label (e.g., US vs Japan for control_complexity)
updateVector = function(inputCountry, inputLabel, lower=0, upper=1) {
inputCountry
inputLabel
dfA <- df %>% filter(label == {{ inputLabel }}) %>%
mutate(answer = (1-0) * (answer-lower) / (upper-lower) + 0) %>% # the filter and mutate_at order is important and cannot be switched
filter(country == {{ inputCountry }}) %>%
filter(!game %in% gameTooFew) %>%
group_by(game) %>%
dplyr::summarise(mean=mean(answer, na.rm=TRUE), sd=sd(answer, na.rm=TRUE))
rnormtA <- c()
for(row in 1:nrow(dfA)) {
meanA <- as.double(dfA[row, "mean"])
sdA <- as.double(dfA[row, "sd"])
rnormtA <- c(rnormtA, meanA)
}
return(rnormtA)
}countries <- unique(df$country)
labels <- c("control_complexity", "learning_curve", "difficulty", "replayability", "pacifist", "made.for.kids", "cozy", "zen", "fantasy", "space",
"heroic", "real.world", "violent", "action", "emotional", "comedic", "experimental", "strategy", "grinding", "anime", "hand.drawn",
"stylized", "action.motivation", "social", "mastery", "achievement", "immersion", "creativity")
output <- data.frame(countryA=c(), countryB=c(), label.corr=c(), label=c())
for(label in labels) {
v <- c()
labelSpace <- data.frame(matrix(ncol = 0, nrow = 600))
for(country in countries) {
if(label %in% c("control_complexity", "learning_curve", "replayability")) {
v <- updateVector(country, label, lower=1, upper=3)
} else if (label == "difficulty") {
v <- updateVector(country, label, lower=1, upper=4)
} else {
v <- updateVector(country, label, lower=0, upper=1)
}
labelSpace <- cbind(labelSpace, as.data.frame(v))
names(labelSpace)[names(labelSpace) == "v"] <- country
}
#break
output <- rbind(corr.df(labelSpace, label), output)
}# Now plotting the two correlations
combined <- hofstedeCorr %>% left_join(output, by=c("countryA", "countryB"))
ggscatter(combined, x = "hofstede.corr", y = "label.corr",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "spearman",
xlab = "Hofstede", ylab = "Label")## `geom_smooth()` using formula 'y ~ x'
6.3 Sampling approach 1
For each country and label, we sample 100 points based on mean and sd. Overall, each dot is a 6-dimensional game vector for each label between two countries. (control_complexity: [PUBG, …])
updateSingleVector = function(inputCountry, inputLabel, lower=0, upper=1) {
inputCountry
inputLabel
dfA <- df %>% filter(label == {{ inputLabel }}) %>%
mutate(answer = (1-0) * (answer-lower) / (upper-lower) + 0) %>% # the filter and mutate_at order is important and cannot be switched
filter(country == {{ inputCountry }}) %>%
filter(!game %in% gameTooFew) %>%
group_by(game) %>%
dplyr::summarise(mean=mean(answer, na.rm=TRUE), sd=sd(answer, na.rm=TRUE))
rnormtA <- c()
for(row in 1:nrow(dfA)) {
meanA <- as.double(dfA[row, "mean"])
sdA <- as.double(dfA[row, "sd"])
gameA = meanA
if(sdA == 0) {
gameA <- rep(meanA, 100)
} else {
gameA <- rnormt(100, c(0, 1), meanA, sdA)
}
rnormtA <- c(rnormtA, gameA)
}
return(rnormtA)
}countries <- unique(df$country)
labels <- c("control_complexity", "learning_curve", "difficulty", "replayability", "pacifist", "made.for.kids", "cozy", "zen", "fantasy", "space",
"heroic", "real.world", "violent", "action", "emotional", "comedic", "experimental", "strategy", "grinding", "anime", "hand.drawn",
"stylized", "action.motivation", "social", "mastery", "achievement", "immersion", "creativity")
output <- data.frame(countryA=c(), countryB=c(), label.corr=c(), label=c())
for(label in labels) {
v <- c()
labelSpace <- data.frame(matrix(ncol = 0, nrow = 600))
for(country in countries) {
if(label %in% c("control_complexity", "learning_curve", "replayability")) {
v <- updateSingleVector(country, label, lower=1, upper=3)
} else if (label == "difficulty") {
v <- updateSingleVector(country, label, lower=1, upper=4)
} else {
v <- updateSingleVector(country, label, lower=0, upper=1)
}
labelSpace <- cbind(labelSpace, as.data.frame(v))
names(labelSpace)[names(labelSpace) == "v"] <- country
}
#break
output <- rbind(corr.df(labelSpace, label), output)
}# Now plotting the two correlations
combined <- hofstedeCorr %>% left_join(output, by=c("countryA", "countryB"))
ggscatter(combined, x = "hofstede.corr", y = "label.corr",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "spearman",
xlab = "Hofstede", ylab = "Label")## `geom_smooth()` using formula 'y ~ x'
6.4 Sampling 2: Game (11) * Label (28) vector correlation + sampling
Each vector is 11 games * 28 labels * 100
# Compare "label space" vector
updateVector <- function(inputCountry, inputrnormtA, inputLabel, lower=0, upper=1) {
# update the vector of a country based on a specific label
# Args:
# country: string of countries
# label: string of the label of interest
# lower, upper: the lower and upper bound (0-1 for binary, 0-3 for control_complexity etc, 0-4 for difficulty)
# Returns:
# an updated vector in the label space
inputLabel
inputCountry
if(upper - lower == 0) {
print("Something happened")
}
dfA <- df %>% filter(label == {{ inputLabel }}) %>%
mutate(answer = (1-0) * (answer-lower) / (upper-lower) + 0) %>% # the filter and mutate_at order is important and cannot be switched
filter(country == {{ inputCountry }}) %>%
filter(!game %in% gameTooFew) %>%
group_by(game) %>%
dplyr::summarise(mean=mean(answer, na.rm=TRUE), sd=sd(answer, na.rm=TRUE))
rnormtA <- inputrnormtA
for(row in 1:nrow(dfA)) {
meanA <- as.double(dfA[row, "mean"])
sdA <- as.double(dfA[row, "sd"])
if(sdA == 0) {
gameA <- rep(meanA, 100)
} else {
gameA <- rnormt(100, c(0, 1), meanA, sdA)
}
rnormtA <- c(rnormtA, gameA)
}
return(rnormtA)
}
labelSpace <- data.frame(matrix(ncol = 0, nrow = 28*600))
countries <- unique(df$country)
labels <- c("control_complexity", "learning_curve", "difficulty", "replayability", "pacifist", "made.for.kids", "cozy", "zen", "fantasy", "space", "heroic", "real.world", "violent",
"action", "emotional", "comedic", "experimental", "strategy", "grinding", "anime", "hand.drawn",
"stylized", "action.motivation", "social", "mastery", "achievement", "immersion", "creativity")
for(country in countries) {
v <- c()
for(label in labels) {
if(label %in% c("control_complexity", "learning_curve", "replayability")) {
v <- updateVector(country, v, label, lower=1, upper=3)
} else if (label == "difficulty") {
v <- updateVector(country, v, label, lower=1, upper=4)
} else {
v <- updateVector(country, v, label, lower=0, upper=1)
}
}
labelSpace <- cbind(labelSpace, as.data.frame(v))
names(labelSpace)[names(labelSpace) == "v"] <- country
}library(corrplot)
mx <- cor(labelSpace, use = "pairwise.complete.obs")
corrplot(mx, type = "upper", order = "FPC", tl.cex=2,
tl.col = "black", tl.srt = 45)# Plot label correlation against the Hofstede scores
mx[upper.tri(mx, diag = TRUE)] <- NA
rownames(mx) <- colnames(mx)
mx <- na.omit(reshape::melt(t(mx)))
mx <- mx[ order(mx$X1, mx$X2), ]
labelCorr <- mx %>% select(X1, X2, value)
labelCorr <- labelCorr %>% dplyr::rename(countryA=X1, countryB=X2, label.corr=value)
labelCorr$countryA <- as.character(labelCorr$countryA)
labelCorr$countryB <- as.character(labelCorr$countryB)
labelCorr <- transform(labelCorr, countryA = pmin(countryA, countryB), countryB=pmax(countryA, countryB))
head(labelCorr)## countryA countryB label.corr
## 17 India US 0.3071151
## 33 Singapore US 0.3100672
## 49 South.Africa US 0.3755622
## 65 Nigeria US 0.2735908
## 81 Argentina US 0.3725302
## 97 Brazil US 0.3708676
- This finding shows that there is no clear trend in the correlation between Hofstede’s dimensions of culture and the label correlations. It seems that the label correlation is roughly the same across these countries. In other words, the Hofstede’s dimensions of culture, which was observed in organizational culture at IBM, doesn’t seem to explain the difference well.
# Now plotting the two correlations
combined <- hofstedeCorr %>% left_join(labelCorr, by=c("countryA", "countryB"))
ggplot(aes(x=hofstede.corr, y=label.corr), data=combined) +
geom_point() +
ylim(c(0,1)) +
geom_smooth(method='lm')## `geom_smooth()` using formula 'y ~ x'
cor(combined$hofstede.corr, combined$label.corr, method = "spearman")## [1] 0.0307313
The following takes a similar approach, but looks at the individual game level.
7 Figure 4: How similar are label choices for bilingual speakers when asked questions in their native language versus English?
library(tidyverse)
library(tidytext)
library(data.table)
source("read.R")
theme_set(
theme_bw() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1, size=8))
)
df <- df %>% filter(!label %in% c("NA.positive.opinion", "NA.negative.opinion", "NA.feeling", "NA.art"))We ran our surveys in both English and non-English languages on bilingual speakers within each non-English speaking country.
rnormt <- function(n, range, mu, s) {
if(s == 0) {
return(c(NaN))
}
# range is a vector of two values
F.a <- pnorm(min(range), mean = mu, sd = s)
F.b <- pnorm(max(range), mean = mu, sd = s)
u <- runif(n, min = F.a, max = F.b)
return(qnorm(u, mean = mu, sd = s))
}# For some weird reason, I need to output the first four variable
compareLanguage <- function(languageA, languageB, inputCountries, label, lower=0, upper=1) {
languageA
languageB
inputCountries
label
dfA = df %>% dplyr::filter(label == {{ label }}) %>% # select the label
mutate(answer = (1-0) * (answer-lower) / (upper-lower) + 0) %>% # rescale the Likert scale answer to 0-1 (e.g., 1-3, 1-4 to 0-1)
dplyr::filter(Language == {{ languageA }} & country %in% inputCountries) %>%
group_by(game) %>%
summarise(mean=mean(answer, na.rm=TRUE), sd=sd(answer, na.rm=TRUE)) %>%
filter(!is.na(sd)) # in case there is only one response in a country
# same operation for languageB
dfB = df %>% dplyr::filter(label == {{ label }}) %>% # select the label
mutate(answer = (1-0) * (answer-lower) / (upper-lower) + 0) %>% # rescale the Likert scale answer to 0-1 (e.g., 1-3, 1-4 to 0-1)
dplyr::filter(Language == {{ languageB }} & country %in% inputCountries) %>%
group_by(game) %>%
summarise(mean=mean(answer, na.rm=TRUE), sd=sd(answer, na.rm=TRUE)) %>%
filter(!is.na(sd)) # in case there is only one response in a country
rnormtA <- c()
rnormtB <- c()
for(row in unique(df$game)) {
meanA <- as.double(dfA[dfA$game == row, "mean"])
meanB <- as.double(dfB[dfB$game == row, "mean"])
sdA <- as.double(dfA[dfA$game == row, "sd"])
sdB <- as.double(dfB[dfB$game == row, "sd"])
if(any(is.na(meanA)) | any(is.na(meanB))) next;
if(any(is.na(sdA)) | any(is.na(sdB))) next;
if(any(sdA == 0) | any(sdB == 0)) next;
gameA <- rnormt(1000, c(0, 1), meanA, sdA)
gameB <- rnormt(1000, c(0, 1), meanB, sdB)
rnormtA <- c(rnormtA, gameA)
rnormtB <- c(rnormtB, gameB)
}
# no game is available
if(length(rnormtA) == 0) {
return(data.frame(languageA = languageA, languageB = languageB, estimate=NaN, p=NaN, label=label))
}
cor.res = cor.test(rnormtA, rnormtB, method=c("pearson"), use = "complete.obs")
return(data.frame(languageA = languageA, languageB = languageB, estimate=cor.res$estimate, p=cor.res$p.value, label=label))
}matrix <- combn(unique(df$Language), 2) # get the combination of languages
new.df <- data.frame(languageA=c(), languageB=c(), estimate=c(), p=c(), label=c())
for(col in 1:ncol(matrix)) {
# for each combination (English vs. language X
l1 <- matrix[1, col] # English
if(l1 != "English") break
l2 <- matrix[2, col] # language X
language.countries <- df %>% filter(Language == l2) # find the correpondings countries to language X
for(label in c("control_complexity", "learning_curve", "replayability")) {
comparison <- compareLanguage(l1, l2, c(unique(language.countries$country)), label, lower=1, upper=3)
new.df <- rbind(new.df, comparison)
}
comparison <- compareLanguage(l1, l2, c(unique(language.countries$country)), "difficulty", lower=1, upper=4)
new.df <- rbind(new.df, comparison)
for(label in labels[!labels %in% c("control_complexity", "learning_curve", "difficulty", "replayability")]){
comparison <- compareLanguage(l1, l2, c(unique(language.countries$country)), label, lower=0, upper=1)
new.df <- rbind(new.df, comparison) # update the new.df
}
}- This result looks very cool as it indicates that the current translation of tags might be leaning toward popular markets. Japanese, Arabic, Greek markets are poorly translated.
new.df %>% filter(estimate < 0)## languageA languageB estimate p label
## cor81 English German -0.0015186309 0.87346612 achievement
## cor88 English Greek -0.0028003688 0.87814680 pacifist
## cor90 English Greek -0.0127911144 0.28460398 cozy
## cor94 English Greek -0.0036384955 0.79701126 real.world
## cor97 English Greek -0.0032613144 0.81766231 emotional
## cor101 English Greek -0.0024696280 0.84832425 grinding
## cor103 English Greek -0.0103806079 0.42143593 hand.drawn
## cor108 English Greek -0.0052058916 0.64153084 achievement
## cor109 English Greek -0.0151299980 0.28477862 immersion
## cor110 English Greek -0.0084885782 0.64210821 creativity
## cor113 English Japanese -0.0193274718 0.22166651 replayability
## cor115 English Japanese -0.0720159174 0.02275921 pacifist
## cor116 English Japanese -0.0234320142 0.29491504 made.for.kids
## cor117 English Japanese -0.0082740802 0.71152920 cozy
## cor118 English Japanese -0.0108711495 0.55170523 zen
## cor120 English Japanese -0.0190434275 0.39466231 heroic
## cor124 English Japanese -0.0063587958 0.84082892 emotional
## cor131 English Japanese -0.0023162515 0.91754925 action.motivation
## cor132 English Japanese -0.0008932378 0.96099559 social
## cor133 English Japanese -0.0454120321 0.04228891 mastery
## cor134 English Japanese -0.0261195416 0.24298013 achievement
## cor136 English Japanese -0.0132795981 0.40110463 creativity
## cor154 English Polish -0.0147580221 0.25304875 grinding
## cor155 English Polish -0.0246806752 0.17654853 hand.drawn
## cor158 English Polish -0.0093197112 0.47043775 social
## cor163 English Arabic -0.0171182883 0.58872101 control_complexity
## cor164 English Arabic -0.0154268056 0.62607602 learning_curve
## cor166 English Arabic -0.0584431924 0.06468865 difficulty
## cor169 English Arabic -0.0022168611 0.94418115 violent
## cor170 English Arabic -0.0368267869 0.24462532 emotional
8 Compare prediction with ESRB ratings
In this section, we aim to answer the question: when training the ML model, should we get data from all over the world? Or should we get data from a specific country?
We tried two approaches:
- Approach 1: y ~ game, GLM model using the survey response
- Approach 2: multi-level regression and post-stratification on gender and age.
In each approach, we’d like to see (1) if the prediction would be different if we train the model using the complete data or the data from a specific country; (2) if the prediction would be different from the ESRB. Please see the bottom of each approach section for observation.
8.1 Approach 1: Linear Regression
Let’s use violent as an example.
violent <- df %>% filter(label == "violent") %>% mutate(answer = factor(answer))
head(violent)## # A tibble: 6 × 17
## Response.ID game country GENDER AGE ageCut native Langu…¹ region exper…²
## <chr> <chr> <fct> <chr> <int> <fct> <fct> <chr> <fct> <dbl>
## 1 US_English_No… PUBG US Male 45 45-55 Engli… English US 5
## 2 US_English_No… PUBG US Male 38 35-44 Engli… English US 6
## 3 US_English_No… PUBG US Male 45 45-55 Engli… English US 6
## 4 US_English_No… PUBG US Male 51 45-55 Engli… English US 6
## 5 US_English_No… PUBG US Male 40 35-44 Engli… English US 6
## 6 US_English_No… PUBG US Male 48 45-55 Engli… English US 6
## # … with 7 more variables: hardcore <fct>, accessibility <fct>, labeler <chr>,
## # ambassador <chr>, label <fct>, answer <fct>, gender <chr>, and abbreviated
## # variable names ¹Language, ²experience
## # ℹ Use `colnames()` to see all variable names
# Let's build a model to see the probability
model <- glm(answer ~ game, data=violent, family=binomial)
# The default game is "Animal Crossing: New Horizons"
# Note that the coefficient is the likelihood
summary(model) ##
## Call:
## glm(formula = answer ~ game, family = binomial, data = violent)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8512 -0.3610 -0.1456 0.6305 3.1386
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.91815 0.35481 -13.861 < 2e-16 ***
## gameCall of Duty: Vanguard 5.85696 0.35885 16.322 < 2e-16 ***
## gameCandy Crush 0.03409 0.54203 0.063 0.94985
## gameElden Ring 5.73004 0.35829 15.993 < 2e-16 ***
## gameFIFA22 0.85590 0.42359 2.021 0.04332 *
## gameGrand Theft Auto V 6.43288 0.35781 17.978 < 2e-16 ***
## gameMario Kart 8 1.14260 0.39184 2.916 0.00355 **
## gameMinecraft 2.22020 0.36376 6.103 1.04e-09 ***
## gamePUBG 5.74748 0.36014 15.959 < 2e-16 ***
## gameStardew Valley 0.37680 0.46666 0.807 0.41942
## gameThe Sims 3 1.45242 0.45301 3.206 0.00135 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22198 on 16800 degrees of freedom
## Residual deviance: 11084 on 16790 degrees of freedom
## AIC: 11106
##
## Number of Fisher Scoring iterations: 7
# We are interested in the probability of choosing one label over the other
newdata = data.frame(game = sort(unique(violent$game)))
newdata$prob <- predict(model, newdata, type="response")
# Note GTA V has a very high probability of getting a "violent label"
newdata## game prob
## 1 Animal Crossing: New Horizons 0.007259528
## 2 Call of Duty: Vanguard 0.718859139
## 3 Candy Crush 0.007509387
## 4 Elden Ring 0.692510549
## 5 FIFA22 0.016918967
## 6 Grand Theft Auto V 0.819760101
## 7 Mario Kart 8 0.022410660
## 8 Minecraft 0.063093881
## 9 PUBG 0.696212732
## 10 Stardew Valley 0.010546500
## 11 The Sims 3 0.030303030
# Is it the case if we are in a different country?
violent.JP <- violent %>% filter(country == "Japan")
model.JP <- glm(answer ~ game, data=violent.JP, family=binomial)
summary(model.JP)##
## Call:
## glm(formula = answer ~ game, family = binomial, data = violent.JP)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.48638 -0.34821 -0.00008 0.30502 2.66119
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.957e+01 1.218e+03 -0.016 0.987
## gameCall of Duty: Vanguard 2.052e+01 1.218e+03 0.017 0.987
## gameCandy Crush 1.516e-08 3.032e+03 0.000 1.000
## gameElden Ring 1.989e+01 1.218e+03 0.016 0.987
## gameFIFA22 1.534e-08 2.596e+03 0.000 1.000
## gameGrand Theft Auto V 2.261e+01 1.218e+03 0.019 0.985
## gameMario Kart 8 1.605e+01 1.218e+03 0.013 0.989
## gameMinecraft 1.813e+01 1.218e+03 0.015 0.988
## gamePUBG 2.165e+01 1.218e+03 0.018 0.986
## gameStardew Valley 1.679e+01 1.218e+03 0.014 0.989
## gameThe Sims 3 1.762e+01 1.218e+03 0.014 0.988
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 600.89 on 479 degrees of freedom
## Residual deviance: 272.85 on 469 degrees of freedom
## AIC: 294.85
##
## Number of Fisher Scoring iterations: 18
# We are interested in the probability of choosing one label over the other
newdata.JP = data.frame(game = sort(unique(violent$game)))
newdata.JP$prob <- predict(model.JP, newdata.JP, type="response")
# Note GTA V has a very high probability of getting a "violent label"
newdata.JP## game prob
## 1 Animal Crossing: New Horizons 3.181005e-09
## 2 Call of Duty: Vanguard 7.222222e-01
## 3 Candy Crush 3.181005e-09
## 4 Elden Ring 5.797101e-01
## 5 FIFA22 3.181005e-09
## 6 Grand Theft Auto V 9.545455e-01
## 7 Mario Kart 8 2.898551e-02
## 8 Minecraft 1.927711e-01
## 9 PUBG 8.888889e-01
## 10 Stardew Valley 5.882353e-02
## 11 The Sims 3 1.250000e-01
We can notice Elden Ring and Minecraft has pretty different predicted probability from above. Let’s contrast the probability when trained on different countries.
countries <- sort(unique(as.character(violent$country)))
games <- sort(unique(violent$game))
mod.df <- newdata %>% dplyr::rename("total" = "prob")
for(country in countries) {
data.country <- violent %>% filter(country == {{ country }})
model.country <- glm(answer ~ game, data=data.country, family=binomial)
newdata.country = data.frame(game = games)
newdata.country$prob <- predict(model.country, newdata.country, type="response")
names(newdata.country)[names(newdata.country) == "prob"] <- country
newdata.country <- newdata.country %>% select(-game)
mod.df <- cbind(mod.df, newdata.country)
}# Let's plot them by games
mod.df %>% pivot_longer(!game, names_to="country", values_to="prob") %>%
mutate(country = factor(country, levels=c(countries, "total"))) %>%
ggplot(aes(x=country, y=prob, color=country)) +
geom_point() +
facet_wrap(~game, scales="free", ncol=3) +
ylim(c(0, 1)) +
geom_hline(yintercept=0.75, linetype="dashed", color = "red") +
geom_hline(data = . %>% filter(country == "total"), mapping = aes(yintercept = prob), linetype="dashed", color = "blue") +
theme(legend.position = "none")Findings
In the plots above, the red dashed line is the y=0.75 threshold. The blue dashed line is y=prob using the total data (the total on the x-axis). We can observe:
- Only the US would label Elden Ring “violent”. The US is above the total and
y=0.75line whereas all the other countries are below. - Similar discrepancy happens in Call of Duty, Grand Theft Auto V, PUBG.
- According to ESRB: GTA V, Elden ring, Call of Duty, PUBG, Minecraft are all violent. A number of countries would be labeling these games differently if we look at responses from specific countries. (For example, all countries except US and Greece for CoD, all countries except US for Elden Ring, all countries except US Greece and Japan for GTA, all countries except US and Japan.)
- Minecraft is labeled as “fantasy violence” but none of the country label it as “violent”.
ESRB rating - GTA-V: https://www.esrb.org/ratings/33073/grand-theft-auto-v/ - Elden Ring: https://www.esrb.org/ratings/38263/elden-ring/ - Call of Duty: https://www.esrb.org/ratings/9518/call-of-duty/ - PUBG: https://www.esrb.org/ratings/10032960/playerunknowns-battlegrounds/ - Minecraft: https://www.esrb.org/ratings/35148/minecraft/
8.2 Approach 2: Post-stratification
The above approach uses a very simple GLM for the binary response. Let’s check if we can find similar pattern with post-stratificaiton. Please refer to Section Figure X: Poststratification and Correlations for details to build the model.
# Recall that we have the final.df dataframe which includes the post-stratified probability
post.df <- final.df %>% select(country, game, label, mrp_pred) %>%
filter(label == "violent")
head(post.df)## country game label mrp_pred
## 1 South.Africa The Sims 3 violent 0.02992721
## 2 Colombia The Sims 3 violent 0.03004292
## 3 Greece The Sims 3 violent 0.02986638
## 4 Korea The Sims 3 violent 0.02987085
## 5 India The Sims 3 violent 0.03005474
## 6 Japan The Sims 3 violent 0.02986290
post.df %>% mutate(country = factor(country, levels=c(countries, "total"))) %>% # reorder to make x axis consistent
ggplot(aes(x=country, y=mrp_pred, color=country)) +
geom_point() +
facet_wrap(~game, scales="free", ncol=3) +
ylim(c(0, 1)) +
geom_hline(yintercept=0.75, linetype="dashed", color = "red") +
geom_hline(data=mod.df %>%
pivot_longer(!game, names_to="country", values_to="prob") %>%
mutate(country = factor(country, levels=c(countries, "total"))) %>%
filter(country == "total"),
mapping = aes(yintercept = prob), linetype="dashed", color = "blue") +
theme(legend.position = "none")Findings
In the plots above, the red dashed line is y=0.75 threshold again. The blue dashed line is y=prob using the total data from the GLM regression in approach 1. We can observe:
- The probability for the US to label games as violent declined.
- South Africa tend to be the outlier for Call of Duty, Elden Ring, GTA.
- Countries, such as Chile, would not rate PUBG, GTA, or Call of Duty as “violent”; a lot of countries would not rate Elden Ring as “violent”. These are all in contrast with the ESRB ratings.
- Different countries would still label games a bit differently for “violent”. The difference doesn’t seem to be mitigated by post-stratification (?).